Sains Malaysiana 54(11)(2025): 2773-2783

http://doi.org/10.17576/jsm-2025-5411-16

 

Determination of the Optimal Number of PLS Components Based on the Combination of Cross-Validation and RMD-MRCD-PCA Weighting Function

(Penentuan Bilangan Komponen PLS yang Optimum berasaskan Gabungan Pengesahan Silang dan Fungsi Pemberat RMD-MRCD-PCA)

 

HABSHAH MIDI1, SITI ZAHARIAH ABDUL WAHAB2,* & AZREE SHAHREL AHMAD NAZRI1

 

1Institute for Mathematical Research, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia

2Malaysian Institute of Information Technology, Universiti Kuala Lumpur, 50250 Kuala Lumpur, Malaysia

 

Received: 19 November 2024/Accepted: 24 October 2025

 

ABSTRACT

Partial least squares (PLS) regression is a very useful tool for the analysis of high dimensional data (HDD). Choosing the ideal number of PLS components is a vital step in developing the best model. The accuracy of the model will be affected if there are too many or too few PLS components being selected. Numerous classical methods, such as the leave-one-out cross-validation (LOOCV) and K-fold cross-validation (K-FoldCV) are developed to determine the optimal number of PLS components. Nonetheless, they are easily affected by high leverage points (HLPs). Thus, robust cross validation techniques, denoted as RMD- MRCD-PCA-LOOCV and RMD-MRCD-PCA-K-FoldCV are proposed to remedy this problem. The results of the simulation study and real data set indicate that the proposed methods successfully select the appropriate number of PLS components.

Keywords: High leverage points; leave-one-out cross validation; minimum regularized covariance determinant; partial least squares; principal component analysis

 

Abstrak

Regresi kuasadua kecil separa (PLS) adalah kaedah yang sangat berguna bagi menganalisis data berdimensi tinggi (HDD). Pemilihan bilangan komponen PLS yang ideal adalah langkah penting bagi membangunkan model terbaik. Ketepatan model akan dipengaruhi sekiranya terlalu banyak atau terlalu sedikit komponen PLS yang dipilih.  Pelbagai kaedah klasik seperti pengesahan silang leave-one-out (LOOCV) dan pengesahan silang lipatan K (K-FoldCV) dibangunkan untuk menentukan bilangan komponen PLS yang optimum. Namun begitu, mereka mudah dipengaruhi oleh titik tuasan tinggi (HLPs). Oleh itu teknik pengesahan silang teguh yang ditandakan dengan RMD- MRCD-PCA-LOOCV dan RMD-MRCD-PCA-K-FoldCV dicadangkan bagi menyelesaikan masalah ini. Keputusan kajian simulasi dan set data sebenar menunjukkan kaedah yang dicadangkan berjaya memilih bilangan komponen PLS yang sesuai.

Kata kunci: Analisis komponen utama; kuasadua terkecil separa; penentu kovarian teratur minimum; pengesahan silang leave-one out; titik tuasan tinggi

 

REFERENCES

Abdullah Mohammed Rashid & Habshah Midi. 2023. Improved nu-support vector regression  algoritm based on the principal component analysis. Economic Computation and Economic Cybernetics Studies and Research 57(2): 41-56. https://doi.org/10.24818/18423264/57.2.23.03

Abdullah Mohammed Rashid, Habshah Midi, Waleed Dhhan & Jayanthi Arasan. 2021. Detection of outliers in high-dimensional data using nu-support vector regression. Journal of Applied Statistics 49(10): 2550-2569. https://doi.org/10.1080/02664763.2021.1911965

Ali Mohammed Baba, Habshah Midi & Nur Haizum Abd Rahman. 2022. Spatial outlier accommodation using a spatial variance shift outlier model. Mathematics 10(17): 3182. https://doi.org/10.3390/math10173182

Boudt, K., Rousseeuw, P.J., Vanduffel, S. & Verdonck, T. 2018. The minimum regularized covariance determinant estimator. Statistics and Computing 30: 113-128. https://doi.org/10.1007/s11222-019-09869-x

Coakley, C.W. & Hettmansperger, T.P. 1993. A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association 88(423): 872-880. https://doi.org/10.1080/01621459.1993.10476352

Filzmoser, P., Liebmann, B. & Varmuza, K. 2009. Repeated double cross validation. Journal of Chemometrics 23(4): 160-171. https://doi.org/10.1002/cem.1225

Geisser, S. 1975. The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350): 320-328. https://doi.org/10.1080/01621459.1975.10479865

Habshah Midi, Jaaz Suhaiza, Mohd Aslam, Hani Syahida & Emi Amielda. 2025. Improved robust principal component analysis based on minimum regularized covariance determinant for the detection of high leverage points in high dimensional data. Sains Malaysiana 54(8): 2087-2097.

Habshah Midi, Shelan Saied Ismaeel, Jayanthi Arasan & Mohammed A Mohammed. 2021. Simple and fast generalized - M (GM) estimator and its application to real data. Sains Malaysiana 50(3): 859-867.

Hubert, M. & Branden, K.V. 2003. Robust methods for partial least square regression. Journal of Chemometrics 17(10): 537-549.

Li, B., Morris, J. & Martin, E.B. 2002. Model selection for partial least squares regression. Chemometrics Intell. Lab. Syst. 64(1): 79-89. https://doi.org/10.1016/S0169-7439(02)00051-5

Mosteller, F. & Wallace, D.L. 1963. Inference in an authorship problem. Journal of the American Statistical Association 58(302): 275-309. https://doi.org/10.4135/9781412961288.n9

Nengsih, T.A., Bertrand, F., Maumy-Bertrand, M. & Meyer, N. 2019. Determining the number of components in PLS regression on incomplete data set. Statistical Applications in Genetics and Molecular Biology18(6):/j/sagmb.2019.18.issue-6/sagmb-2018-0059/sagmb-2018-0059.xml. https://doi.org/10.1515/sagmb-2018-0059

Rousseeuw, P.J. & van Zomeren, B.C. 1990. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association  85(411): 633-639. https://doi.org/10.1080/01621459.1990.10474920

Siti Zahariah & Habshah Midi. 2022. Minimum regularized covariance determinant and principal component analysis-based method for the identification of high leverage points in high dimensional sparse data. Journal of Applied Statistics 50(13): 2817-2835. https://doi.org/10.1080/02664763.2022.2093842

 Waleed Dhhan, Sohel Rana & Habshah Midi. 2016. A high breakdown, high efficiency and bounded influence modified GM estimator based on support vector regression. Journal of Applied Statistics 44(4): 700-714. https://doi.org/10.1080/02664763.2016.1182133

 

*Corresponding author; email: sitizahariah@unikl.edu.my

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

previous

next